Large - scale Document Clustering for Associative Document Search

نویسندگان

Makoto Iwayama

Takenobu Tokunaga

چکیده

Approximated algorithms for clustering large-scale document collection are proposed and evaluated under the context of cluster-based document retrieval (i.e., associative document search). These algorithms use a precise clustering algorithm as a subroutine to construct a strati ed structure of cluster trees. An experiment showed that more than 100 times speedup in cpu time was gained at best. Through experiments of self retrieval and topic assignment, we con rmed su cient search performance on cluster trees that are constructed by approximated algorithms. In particular, top down construction o ered over 99% accuracy of self retrieval which is comparable performance to exhaustive search. Top down construction also o ered promising performance in topic assignment, that is, better recall/precision than that obtained by exhaustive search. All of the results for cluster-based retrieval were obtained by simple and e cient binary tree search.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Hierarchical Fuzzy Clustering Semantics (HFCS) in Web Document for Discovering Latent Semantics

This paper discusses about the future of the World Wide Web development, called Semantic Web. Undoubtedly, Web service is one of the most important services on the Internet, which has had the greatest impact on the generalization of the Internet in human societies. Internet penetration has been an effective factor in growth of the volume of information on the Web. The massive growth of informat...

متن کامل

Learning Document Image Features With SqueezeNet Convolutional Neural Network

The classification of various document images is considered an important step towards building a modern digital library or office automation system. Convolutional Neural Network (CNN) classifiers trained with backpropagation are considered to be the current state of the art model for this task. However, there are two major drawbacks for these classifiers: the huge computational power demand for...

متن کامل

A partition-based algorithm for clustering large-scale software systems

Clustering techniques are used to extract the structure of software for understanding, maintaining, and refactoring. In the literature, most of the proposed approaches for software clustering are divided into hierarchical algorithms and search-based techniques. In the former, clustering is a process of merging (splitting) similar (non-similar) clusters. These techniques suffered from the drawba...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2007

Large - scale Document Clustering for Associative Document Search

نویسندگان

چکیده

منابع مشابه

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Hierarchical Fuzzy Clustering Semantics (HFCS) in Web Document for Discovering Latent Semantics

Learning Document Image Features With SqueezeNet Convolutional Neural Network

A partition-based algorithm for clustering large-scale software systems

عنوان ژورنال:

اشتراک گذاری